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Abstract 

We propose a class of nonparametric two-sample tests with a cost linear in the 
sample size. Two tests are given, both based on an ensemble of distances be¬ 
tween analytic functions representing each of the distributions. The first test uses 
smoothed empirical characteristic functions to represent the distributions, the sec¬ 
ond uses distribution embeddings in a reproducing kernel Hilbert space. Analytic- 
ity implies that differences in the distributions may be detected almost surely at a 
finite number of randomly chosen locations/frequencies. The new tests are consis¬ 
tent against a larger class of alternatives than the previous linear-time tests based 
on the (non-smoothed) empirical characteristic functions, while being much faster 
than the current state-of-the-art quadratic-time kernel-based or energy distance- 
based tests. Experiments on artificial benchmarks and on challenging real-world 
testing problems demonstrate that our tests give a better power/time tradeoff than 
competing approaches, and in some cases, better outright power than even the 
most expensive quadratic-time tests. This performance advantage is retained even 
in high dimensions, and in cases where the difference in distributions is not ob¬ 
servable with low order statistics. 


1 Introduction 

Testing whether two random variables are identically distributed without imposing any parametric 
assumptions on their distributions is important in a variety of scientific applications. These include 
data integration in bioinformatics 0, benchmarking for steganography m and automated model 
checking EH. Such problems are addressed in the statistics literature via two-sample tests (also 
known as homogeneity tests). 

Traditional approaches to two-sample testing are based on distances between representations of the 
distributions, such as density functions, cumulative distribution functions, characteristic functions or 
mean embeddings in a reproducing kernel Hilbert space (RKHS) Il2^l25l . These representations are 
infinite dimensional objects, which poses challenges when defining a distance between distributions. 
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Examples of such distances include the classical Kolmogorov-Smirnov distance (sup-norm between 
cumulative distribution functions); the Maximum Mean Discrepancy (MMD) [Sl, an RKHS norm 
of the difference between mean embeddings, and the N-distance (also known as energy distance) 
|I33][30][3, which is an MMD-based test for a particular family of kernels ll24l . Tests may also be 
based on quantities other than distances, an example being the Kernel Fisher Discriminant (KFD) 
CD, the estimation of which still requires calculating the RKHS norm of a difference of mean 
embeddings, with normalization by an inverse covariance operator. 

In contrast to consistent two-sample tests, heuristics based on pseudo-distances, such as the dif¬ 
ference between characteristic functions evaluated at a single frequency, have been studied in the 
context of goodness-of-fit tests CD HD- It was shown that the power of such tests can be maximized 
against fully specified alternative hypotheses, where test power is the probability of correctly reject¬ 
ing the null hypothesis that the distributions are the same. In other words, if the class of distributions 
being distinguished is known in advance, then the tests can focus only at those particular frequen¬ 
cies where the characteristic functions differ most. This approach was generalized to evaluating the 
empirical characteristic functions at multiple distinct frequencies by IT), thus improving on tests 
that need to know the single “best” frequency in advance (the cost remains linear in the sample size, 
albeit with a larger constant). This approach still fails to solve the consistency problem, however: 
two distinct characteristic functions can agree on an interval, and if the tested frequencies fall in that 
interval, the distributions will be indistinguishable. 

In Section of the present work, we introduce two novel distances between distributions, which 
both use a parsimonious representation of the probability measures. The first distance builds on 
the notion of differences in characteristic functions with the introduction of smooth characteristic 
functions, which can be though as the analytic analogues of the characteristics functions. A distance 
between smooth characteristic functions evaluated at a single random frequency is almost surely a 
distance (Definition[T]formalizes this concept) between these two distributions. In other words, there 
is no need to calculate the whole infinite dimensional representation - it is almost surely sufficient 
to evaluate it at a single random frequency (although checking more frequencies will generally 
result in more powerful tests). The second distance is based on analytic mean embeddings of two 
distributions in a characteristic RKHS; again, it is sufficient to evaluate the distance between mean 
embeddings at a single randomly chosen point to obtain almost surely a distance. To our knowledge, 
this representation is the first mapping of the space of probability measures into a finite dimensional 
Euclidean space (in the simplest case, the real line) that is almost surely an injection, and as a result 
almost surely a metrization. This metrization is very appealing from a computational viewpoint, 
since the statistics based on it have linear time complexity (in the number of samples) and constant 
memory requirements. 

We construct statistical tests in Section]^ based on empirical estimates of differences in the analytic 
representations of the two distributions. Our tests have a number of theoretical and computational 
advantages over previous approaches. The test based on differences between analytic mean embed¬ 
dings is a.s. consistent for all distributions, and the test based on differences between smoothed 
characteristic functions is a.s. consistent for all distributions with integrable characteristic func¬ 
tions (contrast with [7], which is only consistent under much more onerous conditions, as discussed 
above). This same weakness was used by m in justifying a test that integrates over the entire fre¬ 
quency domain (albeit at cost quadratic in the sample size), for which the quadratic-time MMD is 
a generalization m. Compared with such quadratic time tests, our tests can be conducted in linear 
time - hence, we expect their power/computation tradeoff to be superior. 

We provide several experimental benchmarks (Section]^ for our tests. First, we compare test power 
as a function of computation time for two real-life testing settings; amplitude modulated audio 
samples, and the Higgs dataset, which are both challenging multivariate testing problems. Our 
tests give a better power/computation tradeoff than the characteristic function-based tests of Q, 
the previous sub-quadratic-time MMD tests CniED, and the quadratic-time MMD test. In terms 
of power when unlimited computation time is available, we might expect worse performance for 
the new tests, in line with findings for linear- and sub-quadratic-time MMD-based tests IHIlKIol 
ED. Remarkably, such a loss of power is not the rule; for instance, when distinguishing signatures 
of the Higgs boson from background noise m (’Higgs dataset’), we observe that a test based on 
differences in smoothed empirical characteristic functions outperforms the quadratic-time MMD. 
This is in contrast to linear- and sub-quadratic-time MMD-based tests, which by construction are 
less powerful than quadratic-time MMD. Next, for challenging artificial data (both high-dimensional 
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distributions, and distributions for which the difference is very subtle), our tests again give a better 
power/computation tradeoff than competing methods. 

2 Analytic embeddings and distances 


In this section we consider mappings from the space of probability measures into a sub-space of 
real valued analytic functions. We will show that evaluating these maps at J randomly selected 
points is almost surely injective for any J > 0. Using this result, we obtain a simple (randomized) 
metrization of the space of probability measures. This metrization is used in the next section to 
construct linear-time nonparametric two-sample tests. 

To motivate our approach, we begin by recalling an integral family of distances between distribu¬ 
tions, denoted Maximum Mean Discrepancies (MMD) |0. The MMD is defined as 


MMD(P, Q) = sup 
f^Bu 



( 1 ) 


where P and Q are probability measures on E, and B^. is the unit ball in the RKHS associated 
with a positive definite kernel k : E x E ^ H. A popular choice of k is the Gaussian kernel 
k{x,y) = exp(—IIX — 2 /IP/ 7 ^) with bandwidth parameter 7 . It can be shown that the MMD is equal 
to the RKHS distance between so called mean embeddings, 

MMD{P,Q) = \\^,p-^lQ\\H„ (2) 

where pp is an embedding of the probability measure P to Hk, 

Mp(t) = [ k{x,t)dP{x), (3) 

J E 

and II • IIpj, denotes the norm in the RKHS Hk- When k is translation invariant, i.e., k {x,y) = 
k{x — y), the squared MMD can be written ll26l Corollary 4] as 


MMD2(P,Q)=/’ \ippit)-ipQit)\^p-^pit)dt, (4) 

where F denotes the Fourier transform, F~^ is the inverse Fourier transform, and ipp, ipq are the 
characteristic functions of P, Q, respectively. From ll^ Theorem 9], a kernel is called character¬ 
istic when 

MMD(P,Q) = OiffP = Q. (5) 

Any bounded, continuous, translation-invariant kernel whose inverse Fourier transform is almost 
everywhere non-zero is characteristic f26li . By representation ([^, it is clear that MMD with a char¬ 
acteristic kernel is a metric. 


Pseudometrics based on characteristic functions. A practical limitation when using the MMD in 
testing is that empirical estimates are expensive to compute, these being the sum of two U-statistics 
and an empirical average, with cost quadratic in the sample size. We might instead consider a finite 
dimensional approximation to the MMD, achieved by estimating the integral Q, with the random 
variable 

1 

dl,AP,Q) = - ^ |^p(T,) - (6) 

i=i 

where are sampled independently from the distribution with a density function F~^k. This 

type of approximation is applied to various kernel algorithms under the name of random Fourier 
features ll20l[T6ll . In the statistical testing literature, the quantity d^j{P, Q) predates the MMD by 
a considerable time, and was studied in ifT^ fTJl iTl. and more recently revisited in ll32l . Our first 
proposition is that d^ j{P, Q) can be a poor choice of distance between probability measures, as it 
fails to distinguish a large class of measures. The following result is proved in the Appendix. 

Proposition 1. Let J S N and let be a sequence of real valued i.i.d. random variables 

with a distribution which is absolutely continuous with respect to the Lebesgue measure. For any 
e > 0 , there exists an uncountable set A of mutually distinct probability measures (on the real line) 
such that for any P, Q € A, P j{P, Q) = O) > 1 — e. 
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We are therefore motivated to find distances of the form (j^ that can distinguish larger classes of 
distributions, yet remain efficient to compute. These distances are characterized as follows: 

Definition 1 (Random Metric). A random process d with the values in R, indexed with pairs from 
the set of probability measures A4 

d = {d{P,Q) : P,Q e M} 

is said to be a random metric if it satisfies all the conditions for a metric with qualification ‘almost 
surely’. Formally, for all P,Q, R G M, random variables d{P, Q), d{P, R), d{R, Q) must satisfy 

1. d{P, Q) > 0 a.s. 

2. if P = Q, then d{P, Q) — 0 a.s, if P f Q then d{P, Q) = 0 a.s. 

3. d{P, Q) = d{Q, P) a.s. 

4. d{P, Q) < d{P, R) + d{R, Q) a.s. [] 

From the statistical testing point of view, the coincidence axiom of a metric d, d{P, Q) = 0 if and 
only if P = Q, is key, as it ensures consistency against all alternatives. The quantity d^j{P, Q) in 
(j^ violates the coincidence axiom, so it is only a random pseudometric (other axioms are trivially 
satisfied). We remedy this problem by replacing the characteristic functions by smooth characteristic 
functions: 

Definition 2. A smooth characteristic function 4>p{t) of a measure P is a characteristic function of 
P convolved with an analytic smoothing kernel I, i.e. 

4’pit) = / pp{w)l{t — w)dw, f e (7) 

The analogue of d^j{P, Q) for smooth characteristic functions is simply 

1 

dlAAQ) = (8) 

where are sampled independently from the absolutely continuous distribution (returning 

to our earlier example, this might be F~^K,(t) if we believe this to be an informative choice). The 
following theorem, proved in the Appendix, demonstrates that the smoothing greatly increases the 
class of distributions we can distinguish. 

Theorem 1. Let I be an analytic, integrable kernel with an inverse Fourier transform strictly greater 
than zero. Then, for any J > 0, is a random metric on the space of probability measures with 
integrable characteristic functions, and fp is an analytic function. 

This result is primarily a consequence of analyticity of smooth characteristic functions and the fact 
that analytic functions are ’well behaved’. There is an additional, practical advantage to smoothing: 
when the variability in the difference of the characteristic functions is high, and these differences are 
local, smoothing distributes the difference in CFs more evenly in the frequency domain (a simple 
illustration is in Fig. |A.1[ Appendix), making them easier to find by measurement at a small number 
of randomly chosen points. This accounts for the observed improvements in test power in Section 
1^ over differences in unsmoothed CFs. 

Metrics based on mean embeddings. The key step which led us to the construction of a random 
metric d^j is the convolution of the original characteristic functions with an analytic smoothing ker¬ 
nel. This idea need not be restricted to the representations of probability measures in the frequency 
domain. We may instead directly convolve the probability measure with a positive definite kernel k 
(that need not be translation invariant), yielding its mean embedding into the associated RKHS, 

dp{t) = [ k{x,t)dP{x). (9) 

JE 

* Note that this does not imply that realizations of d are distances on A4, but it does imply that they are 
almost surely distances for all arbitrary finite subsets of M. 
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We say that a positive definite kernel k : x R is analytic on its domain if for all x G R^, 

the feature map k{x, •) is an analytic function on R^. By using embeddings with characteristic and 
analytic kernels, we obtain particularly useful representations of distributions. As for the smoothed 
CF case, we define 

1 

dlAP, Q) = -j - liQiTAf- ( 10 ) 

The following theorem ensures that Q) is also a random metric. 

Theorem 2. Let k be an analytic, integrable and characteristic kernel. Then for any <7 > 0, j is 
a random metric on the space of probability measures (and p,p is an analytic function). 

Note that this result is stronger than the one presented in Theorem since is is not restricted to the 
class of probability measures with integrable characteristic functions. Indeed, the assumption that 
the characteristic function is integrable implies the existence and boundedness of a density. Recall¬ 
ing the representation of MMD in Q, we have proved that it is almost always sufficient to measure 
difference between pp and pq at a finite number of points, provided our kernel is characteristic 
and analytic. In the next section, we will see that metrization of the space of probability measures 
using random metrics d^^j is very appealing from the computational point of view. It turns 
out that the statistical tests that arise from those metrics have linear time complexity (in the number 
of samples) and constant memory requirements. 

3 Hypothesis Tests Based on Distances Between Analytic Functions 


In this section, we provide two linear-time two-sample tests: first, a test based on analytic mean 
embeddings, and then a test based on smooth characteristic functions. We further describe the 
relation with competing alternatives. Proofs of this chapter’s propositions are in the Appendix [B] 

Difference in analytic functions In the previous section we described the random metric based 
on a difference in analytic mean embeddings, d^ j{P, Q) = j ^j=i{Tp{'Pj) ~ TQ{Pj))^- If we 
replace pp with the empirical mean embedding pp = ^ ’) it be shown that for any 

sequence of unique under the null hypothesis, as n —oo, 

J 

AnY^ippitA - PQ{tj)f ( 11 ) 

i=i 


converges in distribution to a sum of correlated chi-squared variables. Even for fixed {tj} j=i 7 it is 
very computationally costly to obtain quantiles of this distribution, since this requires a bootstrap 
or permutation procedure. We will follow a different approach based on Hotelling’s T^-statistic 
ca. The Hotelling’s -squared statistic of a normally distributed, zero mean, Gaussian vector 
W = (Wi, • • • , Wj), with a covariance matrix S, is = WT,~^W. The compelling property of 
the statistic is that it is distributed as a x^-random variable with J degrees of freedom. To see a link 
between and equation ( 111 , consider a random variable distributed as a 

sum of correlated chi-squarea variables. In our case W is replaced with a difference of normalized 
empirical mean embeddings, and E is replaced with the empirical covariance of the difference of 
mean embeddings. Formally, let Zi denote the vector of differences between kernels at tests points 
T-, 

Z, = ikiX„T,)-kiY,,T,),--- ,k(X,,Tj)-k(Y,,Tj))GR'’. (12) 


We define the vector of mean empirical differences ^ II® covariance matrix 

S„ = -ZZ"^. The test statistic is 

Sn = nWnY^Wn. ( 13 ) 


The computation of Sn requires inversion of a J x J matrix S„, but this is fast and numerically 
stable: J will typically be small and is in our experiments less than 10. The next proposition 
demonstrates the use of as a two-sample test statistic. 

Proposition 2 (Asymptotic behavior of 5'„). Let d^ j{P, Q) — 0 a.s. and let and 

be i.i.d. samples from P and Q respectively. Then the statistic Sn is a.s. asymptotically distributed 
as a x^-fandom variable with J degrees of freedom (as n ^ oo with d fixed). If d^^ j{P, Q) > 0 
a.s., then a.s. for any fixed r, P(S'„ > r) —>■ 1 ai n —>■ oo . 
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We now apply the above proposition in obtaining a statistical test. 

Test 1 (Analytic mean embedding). Calculate 5'„. Choose a threshold corresponding to the 1—a 
quantile of a distribution with J degrees of freedom, and reject the null hypothesis whenever Sn 
is larger than Tq,. 

There are a number of valid sampling schemes for the test points to evaluate the differences 

in mean embeddings: see Sectionj^for a discussion. 

Difference in smooth characteristic functions From the convolution definition of a smooth char¬ 
acteristic function Q it is not clear how to calculate its estimator in linear time. However, we show 
in the next proposition that a smooth characteristic function can be written as an expected value of 
some function with respect to the given measure, which can be estimated in a linear time. 

Proposition 3. Let k be an integrable translation-invariant kernel and f its inverse Fourier trans¬ 
form. Then the smooth characteristic function of P can be written as fplf) = fjidP* ^ f{x)dP{x). 

It is now clear that a test based on the smooth characteristic functions is similar to the test based on 
mean embeddings. The main difference is in the definition of the vector of differences Zp. 

z, = (/(x,)sin(x,ri)-/(r,)sin(r,Ti),/(x,)cos(x,ri)-/(y,)cos(y,ri),---) e R"*" (i4) 

The imaginary and real part of the ^'f(Yi) are stacked together, in order 

to ensure that Wn, Sn and Sn as all real-valued quantities. 

Proposition 4. Let j{P,Q) = 0 and let and {Yi}f^^ be i.i.d. samples from P and 

Q respectively. Then the statistic Sn is almost surely asymptotically distributed as a x^-fcmdom 
variable with 2J degrees of freedom (as n —>■ oo with J fixed). If d^ j{P,Q) > 0 , then almost 
surely for any fixed r, P(Sn > r) —>■ 1 as n —>■ oo. 

Other tests. The test Q based on empirical characteristic functions was constructed originally 
for one test point and then generalized to many points - it is quite similar to our second test, but 
does not perform smoothing (it is also based on a T^-Hotelling statistic). The block MMD OTl is 
a sub-quadratic test, which can be trivially linearized by fixing the block size, as presented in the 
Appendix. Finally, another alternative is the MMD, an inherently quadratic time test. We scale 
MMD to linear time by sub-sampling our data set, and choosing only yTi points, so that the MMD 
complexity becomes 0(n). Note, however, that the true complexity of MMD involves a permutation 
calculation of the null distribution at cost 0(&„n), where the number of permutations grows with 
n. See Appendix[C]for a detailed description of alternative tests. 

4 Experiments 

In this section we compare two-sample tests on both artificial benchmark data and on real-world 
data. We denote the smooth characteristic function test as ‘Smooth CF’, and the test based on the 
analytic mean embeddings as ‘Mean Embedding’. We compare against several alternative testing ap¬ 
proaches: block MMD (‘Block MMD’), a characteristic functions based test (‘CF’), a sub-sampling 
MMD test (‘MMD(y^)’), and the quadratic-time MMD test (‘MMD(n)’). 

Experimental setup. For all the experiments, D is the dimensionality of samples in a dataset, n 
is a number of samples in the dataset (sample size) and J is number of test frequencies. Parameter 
selection is required for all the tests. The table summarizes the main choices of the parameters made 
for the experiments. The first parameter is the test function, used to calculate the particular statistic. 
The scalar 7 represents the length-scale of the observed data. Notice that for the kernel tests we 

recover the standard parameterization exp(—1|^ — ^|p) = exp(—). The original CF test 
was proposed without any parameters, hence we added 7 to ensure a fair comparison - for this test 
varying 7 is equivalent to adjusting the variance of the distribution of frequencies Tj . For all tests, 
the value of the scaling parameter 7 was chosen so as to maximize test power on a held-out training 
set: details are described in Appendix [P] We chose not to optimize the sampling scheme for the 
Mean Embedding and Smooth CE tests, since this would give them an unfair advantage over the 
Block MMD, MMD(y^) and CE tests. The block size in the Block MMD test and the number of 
test frequencies in the Mean Embedding, Smooth CE, and CE tests, were always set to the same 
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— Smooth CF 

— Block MMD 
CF 

— Mean Embedding 

— MMD(/7r) 

— MMD(n) 



Figure 1; Higgs dataset. Left: Test power vs. sample size. Right: Test power vs. execution time. 


value (not greater than 10) to maintain exactly the same time complexity. Note that we did not use 
the popular median heuristic for kernel bandwidth choice (MMD and B-test), since it gives poor 
results for the Blobs and AM Audio datasets El. We do not run MMD(n) test in the ’Simulation 
1’ and on the ’Amplitude Modulated Music’ since the sample size is 10000, i.e., too large for a 
quadratic-time test with permutation sampling for the test critical value. 

It is important to verify that Type I err or is indeed at the design level, set at a = 0.05 in this paper. 
This is verified in the Appendix figure [a! 2| Also shown in the plots is the 95% percent confidence 
intervals for the results, as averaged over 4000 runs. 


Test 

Test Eunction 

Sampling scheme 

Other parameters 

Mean Embedding 
Smooth CE 
MMD(n),MMD(v^) 
Block MMD 

CE 

exp(-ll^-fll^) 

exp(*fT^-||^-ff) 

exp(-llf-ff) 

exp(-||2 -^f) 

exp(if^f) 

Tj ~ 7V(0d, Id) 

Tj ~ N{0d, Id) 

not applicable 
not applicable 

Tj ~ N{0d. Id) 

J - no. of test frequencies 

J - no. of test frequencies 
b -bootstraps 
R-block size 

J - no. of test frequencies 


Real Data 1: Higgs dataset, D = A, n varies, J = 10. The first experiment we consider is on 
the UCI Higgs dataset ifTTl described in 12 - the task is to distinguish signatures of processes which 
produce Higgs bosons from background processes which do not. We consider a two-sample test on 
certain extremely low-level features in the dataset - kinematic properties measured by the particle 
detectors, i.e., the joint distributions of the azimuthal angular momenta (p for four particle jets. We 
denote by P the jet (^-momenta distribution of the background process (no Higgs bosons), and by 
Q the corresponding distribution for the process that produces Higgs bosons (both are distributions 
on R'*). As discussed in |12 Fig. 2], (/^-momenta, unlike transverse momenta pt, carry very little 
discriminating information for recognizing whether Higgs bosons were produced or not. Therefore, 
we would like to test the null hypothesis that the distributions of angular momenta P (no Higgs boson 
observed) and Q (Higgs boson observed) might yet be rejected. The results for different algorithms 
are presented in the Figure We observe that the joint distribution of the angular momenta is in 
fact a discriminative feature. Sample size varies from 1000 to 12000. The Smooth CF test has 
significantly higher power than the other tests, including the quadratic-time MMD, which we could 
only run on up to 5100 samples due to computational limitations. The leading performance of 
the Smooth CF test is especially remarkable given it is several orders of magnitude faster then the 
quadratic-time MMD(n), which is both expensive to compute, and requires a costly permutation 
approach to determine the significance threshold. 

Real Data 2: Amplitude Modulated Music, D = 1000, n = 10000, J — 10. Amplitude mod¬ 
ulation is the earliest technique used to transmit voice over the radio. In the following experiment 
observations were one thousand dimensional samples of carrier signals that were modulated with 
two different input audio signals from the same album, song P and song Q (further details of these 
data are described in ifTOl Section 5]). To increase the difficulty of the testing problem, independent 
Gaussian noise of increasing variance (in the range 1 to 4.0) was added to the signals. The results 
are presented in the Figure Compared to the other tests, the Mean Embedding and Smooth CF 
tests are more robust to the moderate noise contamination. 
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Figure 2: Music Dataset.Left: Test power vs. added noise. Right: four samples from P and Q. 


— Smooth CF 

— Block MMD 
CF 

— Mean Embedding 

— MMD(vSr) 



Figure 3: Power vs. redundant dimensions comparison for tests on high dimensional data. 


Simulation 1: High Dimensions, D varies, n = 10000, J = 3. It has been recently shown, 
in theory and in practice, that the two-sample problem gets more difficult as the number of the 
dimensions increases on which the distributions do not differ 121] [23. In the following experiment, 
we study the power of the two-sample tests as a function of dimension of the samples. We run two- 
sample test on two datasets of Gaussian random vectors which differ only in the first dimension. 

Dataset I; P = N{QdJd) vs. Q = TV ((1,0, • • • , 0),/_d) 

Dataset II: P = N{QdJd) vs. Q = iV (Oc, diag((2,1, • • • , 1))), 

where 0^ is a H-dimensional vector of zeros. Id is a H-dimensional identity matrix, and diag(u) 
is a diagonal matrix with v on the diagonal. The number of dimensions (D) varies from 50 to 
1000 (Dataset I) and from 50 to 2500 (Dataset II). The power of the different two-sample tests is 
presented in Figure]^ The Mean Embedding test yields best performance for both datasets, where 
the advantage is especially large for differences in variance. 

Simulation 2: Blobs, D = 2, n varies, J = 5. The Blobs dataset is a grid of two dimensional 
Gaussian distributions (see Figure]^, which is known to be a challenging two-sample testing task. 
The difficulty arises from the fact that the difference in distributions is encoded at a much smaller 
lengthscale than the overall data. In this experiment both P and Q are a four by four grid of Gaus- 
sians, where P has unit covariance matrix in each mixture component, while each component of Q 
has a non unit covariance matrix. It was demonstrated by na that a good choice of kernel is crucial 
for this task. Figure [^presents the results of two-sample tests on the Blobs dataset. The number of 
samples varies from 50 to 14000 ( MMD(n) reached test power one with n = 1400). We found that 
the MMD(n) test has the best power as function of the sample size, but the worst power/computation 
tradeoff. By contrast, random distance based tests have the best power/computation tradeoff. 
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— Smooth CF 

— Block MMD 
CF 

— Mean Embedding 

— MMD(^^) 

— MMD(n) 









Figure 4: Blobs Dataset. Left: test power vs. sample size. Center: test power vs. execution time. 
Right: illustration of the blob dataset. Each mixture component in the upper plot is a standard 
Gaussian, whereas those in the lower plot have the direction of the largest variance rotated by 7r/4 
and amplified so the standard deviation in this direction is 2. 
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A Figures 


Characteristic Function distance 


Smooth Characteristic Function distance Analytic Mean Embedding distance 




Figure A.l; Smooth vs non-smooth. Left; pseudo-distance d^^i{P,Q) which uses a single fre¬ 
quency t G as a function of this frequency; Middle; depicted in the same way; 

Right; dfj,^i{P, Q) which uses a single location t G as a function of this location. The measures 
P, Q used are illustrated in Figure]^- these are grids of Gaussian distributions discussed in detail in 
Section |4] 



Figure A.2; Type I error of the blobs dataset (left) and the dimensions dataset (right). The dashed 
line is the 99% Wald interval a ± 2.57\/a(l — Q!)/4000 (4000 is number of repetitions) around the 
design test size of a = 0.05. 


B Proofs 

Proof of Proposition U 

Proof. For some / = /(e), there exists an interval [—/,/] with measure 1 — (1 — e) 7. Dehne 
fw{i) = 1 — w\t\ for ry > j and zero elsewhere. By Polya’s theorem, A = {fw}w>} 
uncountable family of characteristic functions that are the same on the complement of [—/, /], which 
has measure (1 — e) j. For wi > W 2 > j, fwi ^ fw 2 in some neighborhood of 1/wi, hence the 
measures associated with those characteristic functions are different. The probability that all Ti sit 

in the complement of interval [—/,/] is ^(1 — = (1 — g) and such an event implies that 

□ 


Proof of Theorem 12 

First we give a proposition that characterizes limits of analytic functions. 

Proposition 5(0 Proposition 3]). If{fn} ^ sequence of real valued, uniformly bounded analytic 
functions on R'^ converging pointwise to f, then f is analytic. 

Now we characterize the RKHS of an analytic kernel. Similar results were proved for specihc classes 
of kernels in ||29] Theorem 1], ||28] Corollary 3.5]. 
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Lemma 1 . If k is a bounded, analytic kernel on x then all functions in the RKHS Hk 
associated with this kernel are analytic. 

Proof Since R'^ is separable then by lIZTl Lemma 4.33] Hilbert Space T-Lk is separable. By Moore- 
Aronszajn Theorem I?) there exist a set iJo of linear combinations of functions k{-,x),x G R'^, 
which is dense in Hk and Hk is a set of functions which are pointwise limits of Cauchy sequences 
in Hq. For each / G Hk let {/„} G Ho be a sequence of functions converging in the Hilbert Space 
norm to /. Since {fn} is convergent there exists N such that Vn > N \\fn — f\\ < 1. For all n 
there exist a uniform bound on /„ norm 

ll/nll = ll/n - f + f\\< Wfn - f\\ + \\f\\ < max( 1, ^muj^ ||/at ||) + ||/||. (15) 

Since k is bounded, by the ll27l Lemma 4.33], there exists C such that for any / G Hk, ||/||oo < 
C||/||. Therefore for all n 

ll/nlloo < C'max(l,^ma3^||/jv||) + C||/||. (16) 

Finally, using Proposition]^ we conclude that / is analytic. This concludes the proof of Lemma[2 

□ 


Next, we show that analytic functions are ’well behaved’. 

Lemma 2. Let fj, be absolutely continuous measure on R'^ (wrt. the Lebesgue measure). Non-zero, 
analytic function f can be zero at most at the set of measure 0, with respect to the measure p,. 

Proof. If / is zero at the set with a limit point then it is zero everywhere. Therefore / can be zero at 
most at a set A without a limit point, which by definition is a discrete set (distance between any two 
points in A is greater then some e > 0). Discrete sets have zero Lebesgue measure (as a countable 
union of points with zero measure). Since P is absolutely continuous then p{A) is zero as well. □ 

Next, we show how to construct random distances. 

Lemma 3. Let A be an injective mapping from the space of the probability measures into a space 
of analytic functions on R'^. Define 

2 

dl.AP^ Q) = E I 

f=i 

where are real valued i.i.d. random variables from a distribution which is absolutely con¬ 

tinuous with respect to the Lebesgue measure. Then, d\ j{P, Q) is a random metric. 

Proof. Let AP and KQ be images of measures P and Q respectively. We want to apply Lemma 
l^to the analytic function / = AP — KQ, with the measure p = pT^, to see that if P 7 ^ Q then 
j f 0 a.s. To do so, we need to show that P Q implies that / is non-zero. Since mapping to A is 
injective, there must exists at least one point o where / is non-zero. By continuity of /, there exists 
a ball around o in which / is non-zero. 

We have shown that P f Q implies / 7 ^ 0 a.s. which in turn implies that dj^ j{P, Q) > 0 a.s. If 
P — Q then f — 0 and dA,j{P, Q) = 0. 

By the construction dA,,j{P, Q) = dA,j{Q, P) and for any measure U, dA.j{P, Q) < dA,j{P, U) 4- 
dA,j{U, Q) a.s. since the triangle inequality holds for any vectors in R"^. □ 

We are ready to proof Theorem]^ 

Proof of Theorem^ Since k is characteristic the mapping K : P ^ pp is injective. Since k is a 
bounded, analytic kernel on R^^ x R”^, the Lemma[^guarantees that pp is analytic, hence the image 
of A is a subset of analytic functions. Therefore, we can use Lemmal^to see that dA,j{P, Q)^ = 
j(P, Q)^ is a random metric. This concludes the proof of Theorem® □ 
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Proof of Theorem [I] 

We first show that smooth characteristic functions are unique to distributions. 

Lemma 4. If I is an analytic, integrable, translation invariant kernel with an inverse Fourier trans¬ 
form strictly greater then zero and P has integrable characteristic function, then the mapping 

A : P —> (j)p 

is injective and fp is element of the RKHS FLi associated with 1. 


Proof For the integrable characteristic function tp we define a functional L : T-Li ^ R given by 
formula 

Lf = f p{x)f{x)dx (17) 

jR'* 

Since L{f + g) = L{f ) + L{g), L is linear. We check that L is bounded; let B = {f ^ FLi :\\ f \\< 
1} be a unit ball in the Hilbert Space. 


sup IL/1 < sup / p{x)f {x)dx < swp / (p{x)\\f\\l{x,x)dx = / p{x)l{x,x)dx < 

feB feBJ-Rd f^BJ-R.'i jRd 


( 18 ) 


By Riesz representation Theorem there exist f £ H such that {f, f) = (p{x)f{x)dx. By 
reproducing property f is given by equation cj){x) = {(j),l{t ,)) = l{x,t)(p{x)dx. With each 
probability measure P with an integrable characteristic function (pp we associate the smooth char¬ 
acteristic function with 

P ^ 4>p{x) = / l{x,f)pp{x)dx (19) 

jR'* 


We will prove that P —>■ fp is injective. We will show that, Vxfqix) = (f>p{x) implies P = Q. 

4’Q = 4’p ^ / l{x — t)pp{x)dx = / l{x — t)pQ{x)dx. (20) 

7R<i Jr<* 

We apply inverse Fourier transform to this convolution and get 

9 {x)fx{x) = fY{x)g{x) (21) 

Where g = T~^l, fy = T~^pQ and fx = T~^pp. Since inverse Fourier transform is injective 
on the space of the integrable characteristic functions, and all fpp,pQ are integrable CFs, then 
application of the inverse Fourier transform does not enlarge the null space of Eq. ( |20| ). Since 
g{x) > 0, fx{x) = fyi^x) everywhere, implying that the mapping P ^ fp is injective. This 
concludes the proof of Lemma|^ 

□ 


Next, we show that smooth characteristic function is analytic. 

Lemma 5. If I is an analytic, integrable kernel with an inverse Fourier transform strictly greater 
then zero and P has an integrable characteristic function then the smooth characteristic function 
(j)p is analytic. 

Proof By lemma all functions in the RKHS associated with I are analytic and by|^(^p is an 
element of this RKHS. □ 

We are ready to proof Theorem [T] 

Proof of Theorem^ Since I is an analytic, integrable kernel with an inverse Fourier transform 
strictly greater then zero then by the Lemma 0 the mapping A : P ^ fp is injective and A(P) 
is an element of the RKHS associated with 1. The Lemma [5| shows that pp is analytic. Therefore 
we can use Lemmato see that d\j{P, Q)^ = d^j{P, Q)^s a random metric. This concludes the 
proof of Theorem □ 
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Proof of Lemma |3] 


Proof. By Fubini’s theorem we get 

fp{t)= / (pp{t- w)f{w)dv 


i{t-wyx 


\JR'^ 

[ 


dP{x) ) f{w)dw 


'R‘^ 


e-™ ^f{w)dw]dP{x) 


Use of Fubini’s theorem is justified, since the iterated integral is finite ll23l [Theorem 8.8 b] i.e. 


y{t-wy X 


Ir<^ Jr'^ 


f{w)\dP{x)dw 


[ l/(^)l / ldP{x)dw < oo. 
Ir'^ Jr‘‘ 


□ 


Proof of Proposition]^ 

Proof. The probability space of random variables and is a product space i.e 

sequence of T_,’s is defined on the space (Ui, Pi) and the sequence of Xfs is defined on the 
space (U 2 , J^ 2 ) ^ 2 )- We will show that for almost all w S Ui, S'„ converges to distribution with 
J degrees of freedom. We define 

Zf = (fc(X„Ti(a;))- fc(r„ri(cc)),... ,fc(X„TjH) - k{Y,,Tj{u:))) G . (22) 

If there exist a f b, such that Ta{oj) = Ti,{uj), then we set Zf = 0. Otherwise, if EZf = 0 
then y'nW))) = ^ converges to a multivariate Gaussian vector with covariance matrix 

YP = '¥.Zf [Zf)^ (the variance Zf is finite so we use standard multivariate CLT). Therefore 
\im.n^aon{Wf)^ has asymptotically distribution with J degrees of freedom (by 
the CLT and Slutsky’s theorem). Consider 

( J \ 

■ (23) 

If j(P, Q) = 0 then for all j, {fj,p{Tj{uj)) = y,Q{Tj{uj)), which implies that = 0. 

IfEZ“ ^ Othen 

P{Sf >r) = P (iW :)^- I > 0 ) ^ 1. (24) 

To see that, first we show that converges in probability to the positive definite matrix 

jjideed, each entry of the matrix converges to the matrix S“, hence entires of the ma¬ 
trix given by a continuous function of the entries of Y‘^, are limit of the sequence 

Similarly Wf converges in probability to the vector . Since (W ‘^)^> 0 
((j^-i)w positive definite), then being a continuous function of the entries 

of Wf and converges to a‘^. On the other hand ^ converges to zero and the proposition 

follows. Finally since j{P, Q) > 0 almost surely then ¥.Zf 7 ^ 0 for almost all w G Oi. 

We have showed that the proposition hold for almost all uj. Indeed it does not hold if it happens that 
for some a f b, Taioj) = Ti,{uj) or j{P, Q) = 0 for P f Q. But both those events have zero 
measure. 

□ 
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Proof of Proposition!^ The poof is analogue to the proof of the Proposition]^ 


C Other tests 

C.l Quadratic-time MMD test 

For two measures P, Q the population MMD can be written as 

MMD{P,QY = J k{x,x')dP{x)dP{x')— 2 J k{x,y)dP{x)dP{y) + j k(y,y')dP{y)dP{y'). 

An MMD-based test uses as its statistic an empirical estimator of the squared population MMD, and 
rejects the null if this is larger than a threshold corresponding to the 1 — a quantile of the null 
distribution. The minimum variance unbiased estimator of MMD is 

MMDl = i 

Hx, X , y, y ) = k{x, x') + k{y, y) - k{x, y') - k{x , y). 

The test threshold Xa is costly to compute. The null distribution of MMD"^ is an infinite weighted 
sum of chi-squared random variables, where the weights are eigenvalues of the kernel with respect 
to the (unknown) distribution P. A bootstrap or permutation procedure may be used in obtaining 
consistent quantiles of the null distribution, however the cost is 0(6„n^) if we have bn permutations 
and n data points (bn is usually in the hundreds, at minimum). As an alternative consistent proce¬ 
dure, the eigenvalues of the joint gram matrix over samples from P and Q may be used in place of 
the population eigenvalues; the fastest quadratic-time MMD test uses a gamma approximation to the 
null distribution, which works well most of the times, but has no consistency guarantees i). 


C.l Sub-quadratic time MMD test 


An alternative to the quadratic-time MMD test is a B-test (block-based test); the idea is to break the 
data into blocks, compute a quadratic-time statistic on each block, and average these quantities to 
obtain the test statistic. More specifically, for an individual block, laying on the main diagonal and 
starting at position (i — 1)B -f 1, the statistic r]{i) is calculated as 

. iB iB 

= E E h{Xn,Xh,Yn,Yh). (25) 

V2/ a=(i-l)B + l b=(i-l)B-|-l/a 


The overall test statistic is then 



1=1 


(26) 


The choice of B determines computation time - at one extreme is the linear-time MMD suggested 
by ISl fTOl where we have n/2 blocks of size B = 2, and at the other extreme is the usual full MMD 
with 1 block of size n, which requires calculating the test statistic on the whole kernel matrix in 
quadratic time. In our case, the size of the block remains constant as n increases, and is greater than 
2. This is very similar to the case proposed by 1211 , and the consistency of the test is not affected. 

B-test of im assumes that B ^ oo together with n, which implies that the statistic rj defined in 
( |2^ under the null distribution satisfies 

VM3fi^J\f{0,4a^), (27) 


for asymptotic variance (Tg = '¥.xx'k'^{X,X')-\-{Kxx'k{X^X'))^ — 2'¥.x {^x'k{X, X'))^ that 

can easily be estimated directly or by considering the empirical variance of the statistics computed 
within each of the blocks. Note that the same asymptotic variance Cg is obtained in the case of a 
quadratic-time statistic ID - albeit convergence rate being a faster 0(l/n) in that case. Indeed, ( |27| ) 

is obtained directly from the leading term of the variance of each block-based statistic being 
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Figure D.3: Box plot of p-values used for parameter selection. The X axis shows the binary loga¬ 
rithm of the scaling parameter applied to data. We have chosen the scaling with the smallest median. 
If the medians were similar we have used the one that had less outliers and was surrounded with other 
scalings with small p-value. In the example we have chosen 2°.0 scaling for the B-test, ° scaling 
for the Smoothed CF and 2“^° ° scaling for the CF test. 


Therefore, the p-value for B-test is approximated as T* , where $ is the standard normal 

cdf. When B remains constant as n increases, it can be shown that the variance of each block-based 
statistic is exactly and thus we obtain by CLT that 

Therefore, a slight change to p-value needs to be applied when CTq is estimated directly; 
$ however, one simply uses the empirical variance of the individual statistics 

computed within each block, the procedure is unaffected. 

D Parameters Choice 

We split our data set into two disjoint sets, training and testing set, and optimize parameters on the 
training set. We didn’t come up with an automated testing procedure, instead we plotted the p-values 
of tests for different scales. The figure [^presents such a plot for three different tests. The p-values 
were obtained by running the test several times (20 to 50) for each data scaling A. Note that in 
the case of simulations we just generated new training dataset for each repetition for a given data 
scaling. For the music dataset we generated new noises for each scaling and for the Higgs dataset 
we have used bootstrap. The last method is applicable to real life problems i.e. we split our data into 
training and test part and then bootstrap from the training part. 
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